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Abstract 

The paper presents a simple model for recovering affine shape and corre¬ 
spondence from two orthographic views of a three-dimensional object. The 
paper has two parts. In the first part it is shown that four corresponding 
points along two orthographic views, taken under similar illumination con¬ 
ditions, determine affine shape and correspondence for all other points. In 
the second part it is shown that the scheme is useful for purposes of visual 
recognition by generating novel views of an object given two model views in 
full correspondence and four corresponding points between the model views 
and the novel view. It is also shown that the scheme can handle objects 
with smooth boundaries, to a good approximation, without introducing any 
modifications or additional model views. 
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1 Introduction 


Structure from motion (SFM) and visual recognition are intimately related. Re¬ 
covering the structure of a moving three-dimensional (3D) object from its changing 
2D image is dual to the problem of identifying images of an object viewed from a 
variety of vantage points, as instances of the same 3D object. Both require an un¬ 
derstanding of the relationship between the 3D world and its 2D projections, both 
start with the same input and both work with essentially the same ingredients: 3D 
structure of the object, motion or viewing transformation applied to the object, 
and the pointwise correspondence between two or more views of the object. 

In SFM one generally wants to recover information that was lost in the course of 
projection from 3D to 2D. This includes the 3D Euclidean structure of the object 
and the 3D motion transformation from one time instance to the next. Visual 
recognition confronts the same issues but in a more implicit manner. Rather than 
recovering 3D information, one is more concerned in factoring it’s effects out, i.e. 
the effect of shape and viewing transformation, thereby reducing all views of an 
object to a canonical view (or set of views) that represents the object. 

Previous approaches to 3D interpretation traditionally assume that correspondence 
between 2D views is known, or can be measured independently [4, 55, 23, 3, 21, 
31]. Under perspective projection it has been shown that two views undergoing 
infinitesimal motion are, in principal, sufficient to recover shape and motion [41, 
30, 55, 54, 36], however the process in inherently susceptible to noise [17, 2, 46, 
11]. Under orthographic projection, it has been shown that at least three views, 
undergoing general motion, are required to recover the same information [48, 49, 
25, 8, 47]. In object recognition, the approach that seems most relevant to known 
results from structure from motion is the alignment approach [16, 19, 20, 50, 26]. 
Under this framework it has been shown [50, 26] that a 3D model together with 
a small number of corresponding points are sufficient for predicting novel views of 
the rigid object, and recently that shape information can be represented [51], or 
approximated [18], by having instead a set of 2D views of the object. 

The approach to SFM in this study is different from most past approaches in that 
it is guided by a specific goal — performing visual recognition. This implies that 
information to be recovered from the changing 2D image should be no more than 
what is necessary to perform visual recognition. Instead of recovering Euclidean 
shape 1 and 3D motion parameters, the emphasis here is to recover affine shape 2 
and full correspondence between two orthographic views, given limited informa- 

ffiD coordinates relative to a Cartesian frame aligned with the viewer’s coordinate system 
and with the line of sight. 

2 3D coordinates relative to a frame defined by an arbitrary set of four non-coplanar points on 
the object. 
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tion regarding motion parameters — information that is captured by having four 
corresponding points between the two frames. 

The reason for the emphasis on recovering affine shape is twofold. It will be shown 
that affine shape recovered from the correspondence between two model views is 
sufficient for purposes of recognition — one can generate novel views (excluding 
occlusion) of the object undergoing arbitrary 3D affine transformations, given four 
corresponding points with the novel view. Furthermore, affine shape seems to play 
an important role in the perception of kinetic depth displays, even in cases where 
Euclidean shape can theoretically be recovered, as suggested in [45]. 

The emphasis on solving the correspondence problem is inspired from recent devel¬ 
opments in visual recognition using alignment [51, 43] and Radial Basis Functions 
[18] indicating that establishing correspondence between two or more views is a 
major step towards ameliorating the effects of changing view position and illu¬ 
mination conditions. The main new results presented in this study include the 
following: 

• Four corresponding points along two orthographic views, taken under similar 
illumination conditions, together with the instantaneous brightness measure¬ 
ments are sufficient to completely determine, without regularizing assump¬ 
tions, correspondence and affine shape along all other points in the image. 

• The information carried by the four corresponding points can be succinctly 
represented by a 2D affine transformation that serves as a constraint line in 
correspondence space. The scale factor associated with the affine displace¬ 
ment vector is a shape parameter representing the relative deviation, along 
the line of sight, of an object point from a reference plane defined by three 
of the corresponding points. This result is new in its algebraic aspect; the 
concept of representing affine shape as a deviation from a reference plane 
was recently introduced by Koenderink and Van-Doorn [28]. 

• The computational study suggests that the measurement of motion starts 
by setting up a frame of reference determined by a small number of salient, 
unambiguously matched, features. The frame provides a nominal motion, 
which is exact for planar surfaces, and which ‘pulls’ or ‘captures’ all other 
points in that frame. The remaining residual motion is later refined by use 
of local spatio-temporal detectors that are tuned along a known direction 
which is determined by the frame of reference. 

• The result that correspondence can be recovered from two views under similar 
illumination conditions suggests that small changes of view position, can 
be factored out in the course of recognition, using only a single picture of 
the object as a model. Another result is that affine shape recovered from 
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the correspondence between two model views can be used to generate novel 
views of the object undergoing arbitrary 3D affine transformations, given four 
corresponding points with the novel view. It is also shown that this result 
applies to objects with smooth boundaries, to a good approximation, without 
introducing additional model views. (Objects with smooth boundaries, such 
as ellipsoids or spheres, are more complex because the object’s boundary 
contour is not projected from fixed contours on the object [7, 27]). 

The remainder of this section presents the results concerning establishing corre¬ 
spondence and affine shape from two orthographic views (the first three items 
above). Section 2 puts these results in the context of visual recognition (fourth 
item above). 

1.1 Shape and Correspondence from 2 Views 

We assume orthographic views at time instances, t\ and t 2} are taken of a surface 
in 3D space. We assume the convention that the 3D Cartesian frame is aligned 
with the x — y axis in image space, and that the z axis is along the viewer’s optical 
axis. Furthermore, without loss of generality, we assume that the origin of the 3D 
frame is aligned with the point (0, 0) in the image plane. The following notation 
is used. Let P be a point in 3D space at time D, and p = cr[P] be its orthographic 
projection onto the image plane. Let P' be the location of the point P at time 
t 2} and p' = a [P 1 ] be the image space coordinates of P' . We therefore refer to the 
pair p and p' as corresponding points. Let op = p — o denote the vector from the 
point o to p, i.e. op represents the coordinates of p with respect to a new origin 
located at point o. Similarly OP , O'P' , o'p' denote the vectors from O to P, from 
O' to P' and from o' to p', respectively. A point p will be referred to as privileged 
if its corresponding point p' is given as input. 

Let (9, Pi, P 2} P3 be four non-coplanar reference points 3 on an object of interest 
in 3D. Taking O to be the origin, we obtain a 3D affine coordinate frame, and 
therefore, any point P on the object can be represented in the affine coordinate 
frame with its associated set of coordinates &i, b 2} b 3 in the following way: 

op = Y,b,(op,) 

j=i 

The crucial point is that the 6’s are invariant with respect to linear transformations 
applied to the equation above (which correspond to affine transformations in space 
that include rotation, translation, scaling and shearing of the object). 

3 The term reference point is adopted from projective geometry (see [42]). 
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Let the object undergo an arbitrary affine transformation in space, and let O', P(, P' 2l P 3 
and P' be the new space locations of the affine coordinate frame and the point of 
interest P. We therefore have: 


o'p = j:bAO'p;). 

:i =1 

Under orthographic projection, we have the following relation between the image 
coordinates of the affine frame in both views, and the image coordinates of the 
point of interest in both views: 


3 


°p = Y, b i(°P]) 

( 1 ) 

i=i 
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op = I] bjiop'j). 

3 =1 

(2) 


The four equations in formulas 1,2 combine together shape, i.e. affine coordinates, 
projected motion, i.e. motion of four points, and correspondence. Therefore, 
given the projected motion, captured by four corresponding points, and the affine 
coordinates we can immediately obtain correspondence as well. Also, given the 

correspondence p < - > p' we have 4 equations for 3 affine coordinates which also 

shows that a ‘view and a half’ is sufficient for recovering affine shape (see also [35, 
51]). Note also that formula 1 provides two equations for solving for the affine 
coordinates — the third equation has been lost because of the projection from 3D 
to 2D. 

We can compensate for the loss of the third equation by producing an equation 
directly from the changing brightness 4 . We assume that both views are taken 
under identical illumination conditions, namely, that brightness change is induced 
purely by motion and not by photometric effects of changing viewing angle or angle 
between light sources and surface orientation. In other words, we assume that the 
brightness of an image point p is equal to the brightness of its corresponding point 
p' in the second view (Horn and Schunk [23]). By further assuming that the motion 
is infinitesimal (an assumption that will be relaxed later on), we obtain from the 
expansion of the total derivative of brightness at p a linear approximation to the 
change of brightness due to motion, known as the constant brightness equation [23]: 

V/ • v + I t = 0 

4 The term ‘brightness’ has different meanings in vision literature. Here it is referred to the 
raw image intensities (term adopted from Horn [22]). 
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where v = p' — p is the unknown displacement vector, V/ is the gradient at point p 
in the image of the first view, and I t is the temporal derivative at p. The constant 
brightness equation provides only one component of the displacement vector v, the 
component along the gradient direction, or normal to the isobrightness contour at 
p. This ‘normal flow’ information, provided by the changing brightness, is sufficient 
to uniquely determine the affine coordinates bj at p, as shown next. By subtracting 
equation 1 from equation 2 we get the following relation: 

v = J2 b J y J + (! - J2 b j) y o (3) 

3 = 1 j 

where Vj j = 0,3 are the known displacement vectors of the privileged points. By 
substituting equation 3 in the constant brightness equation we get a new equation 
in which the affine coordinates are the only unknowns: 

b 3 \VI(v 3 - u 0 )] + I t + VIv 0 = 0. (4) 

3 

Equations 1, and 4, provide a complete set of linear equations (ignoring singular 
cases) to solve for the affine coordinates from which, in return, we obtain corre¬ 
spondence. We have therefore proven the following ‘4pt + brightness’ proposition: 

Proposition 1 (4pt + brightness) Two orthographic images of a shaded 3D 
surface with four clearly marked reference points, admit a complete set of linear 
equations representing the affine coordinates of all surface points (excluding sin¬ 
gular cases), provided that the surface is undergoing an infinitesimal affine trans¬ 
formation and the two orthographic images are taken under identical illumination 
conditions. 


Comments 


Rigidity: note that rigidity is not required for solving for affine coordinates and 
correspondence. If correspondence is the main concern, say for model building 
[51, 43, 18, 6], then by assuming the transformation between the two views to be 
any linear transformation, allows one to tolerate certain non-rigid transformations, 
as long as the held of view is sufficiently small. This may also be relevant for a 
surface undergoing a rigid transformation but viewed under situations that do not 
fully meet the requirements of the orthographic projection model. This notion, 
however, is not pursued further here. 

Identical Illumination Conditions: the assumption of identical illumination con¬ 
ditions is a useful approximation for a Lambertian surface under multiple light 
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sources or hemispherical illumination. In those cases the change of brightness due 
to motion in space is much larger than the change in brightness induced by photo¬ 
metric effect, such as changing viewing direction or illumination. The assumption 
holds exactly for an object rotating around the vertical axis under hemispherical 
illumination, a situation which is quite common in natural environments. (See also 
[24, 53, 34] for quantitative and experimental analysis). Local photometric effects 
can also be ameliorated to some degree by applying a linear operator, such as the 
Laplace operator, to the brightness values, prior to using the constant brightness 
equation (Bergen and Adelson [9]). 

1.2 Constraint Lines in Correspondence Space 

The system of equations leading to Proposition 1 can be decomposed into two 
constraint lines intersecting at p' for any given point p. One constraint line comes 
directly from the constant brightness equation: a line passing through the point 
p — V/ in direction perpendicular to the direction of the gradient V/ at point 
p. The second constraint line can be derived from equations 1 and 2 as shown 
below. 

We rewrite equations 1 and 2 in matrix form: Let M be a 2 X 3 matrix whose 
column vectors are opi } op 2} op 3} and similarly M' has o'p' l7 o'p' 2} o'p' 2 as column 
vectors. We therefore have: op = Mb and o'p' = M'b. Since the system op = Mb 
is underdetermined, then the solution b is determined only up to an element of the 
null space of M , namely, for every solution r the vector r + as is also a solution, 
where a is a scale factor and Ms = 0. We can substitute b in the system o'p' = M'b 
by r + as and obtain the following constraint line equation: 

p' = o' + M'r + aM's = r + as. (5) 

Note that r depends on p whereas s is fixed for all points, therefore the constraint 
lines passing through all points in the image of the moving surface are parallel to 
each other. The unknown parameter a can be found by using the first constraint 
line whenever the gradient is non-vanishing and is not perpendicular to s, i.e. s 
is not in the direction of the isobrightness contour at p (see Fig. 1). We have 
therefore proven the following proposition: 

Proposition 2 A constraint line in correspondence space can be recovered from 
four corresponding points along two orthographic views of an object undergoing an 
arbitrary affine transformation in space. 

Different versions of this result have been proposed in the past. Huang and Lee [25] 
and Basri [6] derive the same constraint line (which is different from the one pre¬ 
sented here) by different approaches. Huang and Lee assume a rigid transformation 
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Figure 1: Two constraint lines that intersect at the corresponding point p' . The 
vector n is the normal component of the displacement vector, n = V/ and n = j^yy. 
The vectors n,r and the scalars o,/ t are a function of the location p. The vector 
s is fixed for all points and can be determined only up to a scale factor. 

and use that as an algebraic constraint to derive a constraint line. Basri’s deriva¬ 
tion is based on the result, originally developed in [51], that all views of an object 
undergoing an a Hi lift transformation in space are spanned by a linear combination 
of two views. This also shows that rigidity is not required for obtaining the con¬ 
straint line. Koenderink and Van-Doorn [28] and Lamdan and Wolfson [29] derive 
a particular case of equation 5, the case where r 3 = 0. 

The displacement vector p 1 — p varies with the 3D coordinates of P and with the 
affine transformation applied to the object in space. Huang and Lee [25] have 
shown that the contribution of depth and motion cannot be decoupled from two 
orthographic views. The following result shows that a particular case of equation 5 
can be realized by a 2D affine transformation defined by the four privileged points, 
to which a is a fixed function of the 3D affine coordinates of P, namely, is motion 
invariant. 

Proposition 3 Four corresponding points, orthographically projected from four 
reference points in space, determine a 2D affine transformation A,w that represent 
a constraint line in correspondence space, o'p' = A(op) + w + aw, where a is a fixed 
function of the affine coordinates of P and is independent of the object 's motion. 

Proof: The four corresponding points define three, non-collinear, corresponding 
vectors opj < - > off j = 1,2,3. Because of non-collinearity of the vectors, there 


7 




exists a unique 2D affine transformation, A,w that aligns the corresponding vectors: 

o'p'j = A(o Pj ) + w } j = 1 , 2 ,3 (6) 

where A is a 2 X 2 matrix and re is a 2 X 1 vector. Applying the affine transformation 
to an arbitrary point p, yields the following result: 

A(op) + w = bj(opj)) + w = J2 bj{°'p'j -w) + w = op + (1 - Y bj)w. (7) 
j j j 

Equation 1 was used in the second term, equation 6 in the third term and equation 
2 in the last term. After rearrangement we get: 

p = [A(op) + o' + w] + (Y bj - 1 )w 

□ 

The proposition contains two statements: the first is that under an affine coordi¬ 
nate frame one can derive a constraint line from four corresponding points such 
that the remaining degree of freedom a depends only on shape, i.e. is motion 
invariant. The second statement is that all of the above is captured by a 2D affine 
transformation derived directly from the four corresponding points. 

The first statement is not new and has been introduced recently by Koenderink and 
Van-Doorn [28] by geometrically constructing a constraint line for which a = b 3 . 
An algebraic version of their result, which also shows that it is a particular case of 
equation 5 with r 3 = 0 is given in appendix 1. Koenderink and Van-Doorn have 
also derived the geometrical equivalent of a showing that it represents the relative 
deviation, along the line of sight, of P from the plane passing through the three 
reference points — thereby showing that shape is recovered up to depth scaling 
and shear. 

The geometrical equivalent of a follows directly from the 2D affine representation of 
the constraint line by noticing that bj = 1 for every point P that is coplanar with 
the three reference points. Therefore, the transformation A(op) + w accounts for 
the projected motion of a plane — a result well known in projective geometry [42] 
— and that a = b 3 — 1 represents the deviation of P from that plane. The 
geometrical interpretation, much of which was described earlier in [28], can be 
summarized in the next proposition. 

The following notations are added. The plane passing through Pi, P 2 , P 3 is referred 
to as the reference plane. The point P is the orthographic projection (along the 
line of sight) of the point P onto the reference plane. The point P' is the new 
location of P following an affine transformation T in space. 

Proposition 4 The 2D affine transformation defined in Proposition 3 admits the 
following interpretation: the affine vector w is the projection of the vector O' — O' 
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onto the image plane and is perpendicular to the xy projection of the rotation axis 
of the transformation T. If T is a similarity transformation (rotation, translation 
and scale), then a associated with the point p is equal to: 

z — z 

a = - 

2 0 - 2 0 

where z,z,z 0 and z 0 are the depth values of P,P,0 and O, respectively. 

Proof: See appendix 2.[] 

The affine shape parameter a provides, therefore, shape modulo translation in 
depth, depth scaling and shear [28]. Translation in depth is unavoidable in or¬ 
thographic projection, depth scaling comes from the distance, z 0 — z 0 , between 
the reference point O and the reference plane, and shear comes from the distance, 
z — 5, between object points and the reference plane, whose orientation is unknown. 
Therefore, different sets of four reference points are associated with different ori¬ 
entations of the reference plane and, therefore, give rise to different affine shape. 

The question that is dealt with next is whether shape modulo depth scale and 
shear is the most one can obtain from two orthographic views. It has been shown 
by Ullman [49] that when the two views are separated by an infinitesimal angle 
rotation, then shape can be recovered up to an overall depth scaling. The depth 
scaling proposition holds also for planar objects but with an added ambiguity, 
namely, the orthographic velocity held determines exactly two solutions, each up 
to a depth scaling [49]. The depth scaling proposition no longer holds under finite 
angle transformations, as shown next, and the best one can achieve is shape up 
to depth scale and shear, namely affine shape. In order to eliminate the shear 
component from the affine shape one has to uniquely recover the equation of the 
reference plane, up to an overall depth scaling, and therefore the more general 
question is whether the depth scaling proposition holds for planar objects under 
finite angle transformations. The result shown below is that the parameters of 
the appropriate constraint line and the equation of the depth scaled plane admit 
a linear one parameter family of solutions. Therefore, one cannot possibly recover 
the plane up to a depth scaling from only two orthographic views separated by a 
finite angle transformation. 

Proposition 5 The constraint line parameters B , s and shape parameters a, b de¬ 
scribing the motion of a planar object z = ax + by + 1, can be determined, up to a 
linear one parameter family of solutions, from four corresponding points along two 
orthographic views of the plane undergoing an affine transformation in space. 

Comments. The equation ax + by + 1 determines the depth 5 of points on the 
plane up to a translation and depth scaling, i.e. I~ z ° = ax + by + 1 where is the 

Zq — Zq 
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depth of the moving origin, 5 0 is the depth of the point where the plane intersects 
the line of sight, and x, y are coordinates relative to x 0 ,y 0 . Therefore, if a, b can be 
determined uniquely, then by subtracting ax + by + 1 from the affine shape a we 
obtain shape of a non-planar object up to depth scaling. The proposition states 
that one cannot determine a, b uniquely from just two views. 

Proof: We subtract the corresponding point o < - > o' from both views and the 

remaining three corresponding points are used to determine the parameters of the 
following constraint line: 

p'j = Bp 3 + (axj + by 3 + l)s j = 1,2,3 

where B is a 2 X 2 matrix and s is a 2 X 1 vector. Given the affine transformation 
defined by p'- = Apj -\-w we have from Proposition 3 that s = pw for some constant 
p and that B — A is a projection matrix p[ww'] } for some constant p. We have 
therefore, 

p'j = {A + plww^pj + ( paxj + pbyj + p)w j = 1, 2, 3, 
which is reduced to, 

0 = p(w t pj)w + ( paxj + pby 3 + p — 1 )w 

from which we get the following linear system of three equations for the four 
unknowns p, pa, pb , p\ 

1 = p(w t p J ) + paxj + pbyj + p. 

These equations are linearly independent as long as the three points are not 
collinear. The system is underdetermined with any number of corresponding 
points, because any additional point must be coplanar with the three reference 
points and therefore is a convex combination of these points.[] 

Propositions 3,4 and 5 put together show that the 2D affine transformation, recov¬ 
ered directly from four corresponding points, represent all the information possible 
from two orthographic views. A(op) + o' + w accounts for the projected motion of 
all points P that are coplanar with and the residual for non-coplanar 

points is simply a vector along w whose length relative to w represents the shape 
of the object up to depth scaling and shear. The next proposition shows that the 
magnitude of the residual motion for non-coplanar points is bounded from above 
by the depth variation between the surface and the reference plane. 

Proposition 6 Let V\,V 2 be two orthographic views produced by a rigid trans¬ 
formation, and let be the view V\ followed by the 2D affine transformation of 
Proposition 3. The remaining distance between points p' in V( and their corre¬ 
sponding points p' in V 2 is bounded by \ z — z \ the relative depth between P and 
its projection P onto the reference plane. 
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Proof: we have that p' — p' = aw where a = z _ z . Since w = a[T(0 — 0)\ and 
T is a rigid transformation, therefore | w |<| 5 0 — |. [] 

Overall scale differences due to translation in depth can be corrected before apply¬ 
ing Proposition 6 (see for example [28]), therefore the result applies to similarity 
transformations as well. The importance of this result is that it suggests that 
surface shape and motion range can be decoupled, provided that four correspond¬ 
ing points can be identified. The smaller the depth variation between the surface 
and the reference plane, the larger the range of motion that can be detected from 
two orthographic views. This can be realized by a two stage computation which 
starts with a nominal motion transformation (first term of equation 8), followed 
by a residual motion computation (the term aw) with the aid of the brightness 
information. The nominal motion transformation provides a first approximation 
(determined only by four corresponding points), which is the exact motion for a 
planar object, leaving a residual whose magnitude is bounded by the depth varia¬ 
tion between the surface and the reference plane. The final refinement, determining 
the residual motion, is provided by the second stage in which the brightness infor¬ 
mation is used in the form of a second constraint line, as described earlier. This 
point is developed further below, suggesting a general scheme for measurement of 
motion. 


1.3 Frame of Reference and the Measurement of Motion 

The results of section 1.2 suggest that the measurement of motion is conducted rel¬ 
ative to a frame of reference, in the form of a reference plane, which determines the 
direction of motion and the limits on its range (Proposition 6). The range of spa¬ 
tial displacements is bounded by the depth variation between the moving surface 
and the reference plane. This suggests, therefore, that the frame of reference pro¬ 
vides a nominal motion everywhere, which is exact for planar surfaces, by ‘pulling’ 
or ‘capturing’ the motion of all points that are under its influence. The residual 
motion is later refined by use of local spatio-temporal detectors that implement 
the constant brightness equation, or any other correlation scheme [32, 52, 1], along 
the fixed direction determined by the frame of reference. 

The notion of a frame of reference that precedes the computation of motion may 
have some support in human vision literature, although not directly. The phe¬ 
nomenon of ‘motion capture’ introduced by Ramachandran [38, 39, 40] is sugges¬ 
tive to the kind of motion measurement presented here. Ramachandran and his 
collaborators observed that the motion of certain salient image features (such as 
gratings or illusory squares) tend to dominate the perceived motion in the enclosed 
area by masking incoherent motion signals derived from uncorrelated random dot 
patterns, in a winner-take-all fashion. Ramachandran therefore suggested that mo¬ 
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tion is computed by using salient features that are matched unambiguously and 
that the visual system assumes that the incoherent signals have moved together 
with those salient features [38]. The scheme suggested in this paper may be viewed 
as a refinement of this idea. Motion is ‘captured’ in Ramachandran’s sense for the 
case of a planar surface in motion, not by assuming the motion of the the salient 
features but by computing the nominal motion transformation. For a non-planar 
surface the nominal motion is only a first approximation which is further refined 
by use of spatio-temporal detectors, provided that the remaining residual displace¬ 
ment is in their range, namely, the surface captured by the frame of reference 
is sufficiently flat. In this view the effect of capture attenuates with increasing 
depth of points from the reference plane, and is not affected, in principle, by the 
proximity of points to the salient features in the image plane. 

The motion capture phenomenon also suggests that the salient features that are 
selected for providing a frame of reference must be spatially arranged to provide 
sufficient cues that the enclosed pattern is indeed part of the same surface. In 
other words, not any arrangement of four non-coplanar points, although theoret¬ 
ically sufficient, is an appropriate candidate for a frame of reference. This point 
has also been raised by Subirana-Vilanova and Richards [44] in addressing per¬ 
ceptual organization issues. They claim that convex image chunks are used as a 
frame of reference that is imposed in the image prior to constructing an object de¬ 
scription for recognition. The frame then determines inside/outside, top/bottom, 
extraction/contraction and near/far relations that are used for matching image 
constructs to a model. 

Other suggestive data include stereoscopic interpolation experiments by Mitchi- 
son and McKee [33]. They describe a stereogram which has a central periodic 
region bounded by unambiguously matched edges. In certain conditions the edges 
impose one of the expected discrete matchings (similar to stereoscopic capture, 
see also [37]). In other conditions a linear interpolation in depth occurred be¬ 
tween the edges violating any possible point-to-point match between the periodic 
regions. The linear interpolation in depth corresponds to a plane passing through 
the unambiguously matched points, which supports the idea that correspondence 
starts with the computation of nominal motion, determined by a small number 
of salient unambiguously matched points, and is later refined using short-range 
motion mechanisms. Finally, experiments by Todd and Bressan [45] demonstrate 
that human subjects can determine whether a moving surface is planar from only 
two orthographic views. This may also suggest that the computation of a frame 
of reference in the form of planar nominal motion precedes the final computation 
of motion. 

To conclude, the computational results suggest that a long-range mechanism sets 
up a frame of reference by tracking a selected set of features. The frame pro- 
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vides a nominal transformation and a matching direction for all other points in 
the enclosed region. The remaining residual motion following the nominal trans¬ 
formation is handled by short-range motion detectors. This view differs from the 
classical short-range vs. long-range motion detection in two respects. First it is 
suggested that the two mechanisms interact in a specific way. Second, the range 
of detected motion depends not only on the range of the spatio-temporal detectors 
but also on the three-dimensional shape of the surface, namely, the magnitude of 
the residual motion depends on how close the enclosed surface is to a plane. 


1.4 Implementation 

The use of the constant brightness equation for determining the residual motion 
term aw assumes that | aw | is small. In practice, the residual motion is not suf¬ 
ficiently small everywhere and, therefore, a hierarchical motion estimation frame¬ 
work is adopted for the implementation. The assumption of small residual motion 
is relative to the spatial neighborhood and to the temporal delay between frames; 
it is the ratio of the spatial to the temporal sampling step that is required to be 
small. Therefore, the smoother the surface the larger the residual motion that can 
be accommodated. The Laplacian Pyramid [12] is used for hierarchical estimation 
by refining a at multiple resolutions. The rationale being that large residuals at 
the resolution of the original image are represented as small residuals at coarser 
resolutions, therefore satisfying the requirement of small displacement. The a 
estimates from previous resolutions are used to bring the image pair into closer 
registration at the next finer resolution. 

The particular details of implementation follow the ‘warp’ motion framework sug¬ 
gested by Bergen and Adelson [9] and by Bergen and Hingorani [10]. Described in 
a nutshell, a synthesized intermediate image is first created by applying the nom¬ 
inal transformation to the first view. To avoid subpixel coordinates, we actually 
compute flow from the second view towards the first view. In other words, the 
intermediate frame at location p contains a bilinear interpolation of the brightness 
values of the four nearest pixels to the location p' = A(op) + o' + w in the first 
view, where the 2D affine parameters A, w were computed from view 2 to view 
1. The a held is estimated incrementally by projecting previous estimates at a 
coarse resolution to a finer resolution level. Gaps in the estimation of a, because 
of vanishing image gradients or other low confidence criteria, are hlled-in at each 
level of resolution by means of membrane interpolation. Once the a held is pro¬ 
jected to the hner level, the displacement held is computed (the vector aw) and 
the two images, the intermediate and the second image, are brought into closer 
registration. This procedure proceeds incrementally until the hnest resolution has 
been reached. 
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1.5 Experimental Results 


Experiments were done on real imagery of ‘Ken’, a doll, undergoing rigid rotation, 
mainly around the vertical axis. Four snapshots were taken covering altogether 
about 23 degrees of rotation. The light setting consisted of two point light sources 
located in front of the object, 60 degrees apart from each other. 

Three experiments were conducted: (i) long range motion by incrementally adding 
flow produced by each pair of consecutive images, (ii) long range motion directly, 
and (iii) establishing approximate correspondence using a single corresponding 
point and normal flow information. 

Privileged points were obtained from flow fields generated by the warp motion 
algorithm [9, 10] along points having good contrast at high spatial frequencies, 
e.g. the tip of the eyes, mouth and eye-brows (the location of those points were 
determined manually). 

The combination of the particular light setting and the complexity of the object 
make it a challenging experiment for the following two reasons: (i) the object is suf¬ 
ficiently complex to have cast shadows and specular points, both of which undergo 
a different motion than the object itself, and (ii) surface material is dominantly 
Lambertian and therefore, coupled with the light setting, brightness change will 
be induced because of change in viewing angle in addition to the change due to 
motion. 

The results of correspondence in all these experiments are displayed in several 
forms. The flow held is displayed to illustrate the stability of the algorithm, in¬ 
dicated by the smoothness of the how held. The hrst image is ‘warped’ using the 
how held to create a synthetic image that should match the second image. The 
warped image is displayed in order to check for deformations (or lack there of). 
Finally, the warped image is compared with the second image by superimposing, 
or taking the difference of, their edge images that were produced using a Canny 
[15] edge detector with the same parameter settings. 

Incremental Long Range Motion 

In this experiment, how was computed independently between each consecutive 
pair of images, using a hxed set of four privileged points, and then added up to 
form a how from the hrst image, Kenl, to the fourth image, Ken4. The rationale 
behind this experiment is that because shape is an integral part of computing 
correspondence/how, then how from one consecutive pair to the next should add 
up in a consistent manner. 

Fig. 2 shows the results on the hrst pair of images, Kenl and Ken2, separated by 
6 ° rotation. The warped image shows no signs of deformation. As expected, the 
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Figure 2: Results of shape and correspondence for the pair Kent and Ken2. First 
row: Kenl,I\en2 and the warped image Kenl-2. Second row: edges of Kent and 
Ken2 superimposed, edges of Ken2 and Kenl-2 superimposed, difference between 
edges of I\en2 and Kenl-2. Third row: flow held in the case where a, the shape 
constant, is estimated in a least squares manner in a 5 X 5 sliding window, and 
how held when a is computed at a single.point (no smoothing). 
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Figure 3: Three-dimensional plot of the shape constant a. 


location of strong cast shadows (one near the dividing hair line) and specular points 
in the warped image do not match those in Ken2. The superimposed edge images 
illustrate that correspondence is accurate, at least up to a pixel accuracy level. The 
flow held is smooth even in the case where no explicit smoothing was done. Finally, 
in Fig. 3 the shape constants a are displayed in a three-dimensional plot. One can 
clearly see the structure of the head and the bumps and dents corresponding to 
the location of nose, chin and eyes. One cannot recognize, however, the particular 
face from this plot or claim that it is a good rendering of a three-dimensional 
face. The change in brightness due to change in viewing angle is an important 
cue that is not modeled in this framework, and that may explain the inaccuracies 
in recovering shape for the images used here. It is also interesting to note the 
discrepancy between the perceived correspondence, which appears to be accurate, 
and the true correspondence that would have led to accurate shape constants. This 
suggests that good correspondence, in the sense of registration, is more attainable 
than reliable shape descriptors when dealing with real images. More on that, and 
the relation to visual recognition, in section 2. 

Fig. 4 shows the results of adding flow between consecutive pairs computed inde¬ 
pendently (using the same four privileged points) to produce flow from Kenl to 
I\en4. Except the point specularities and the strong shadow at the hair line, the 
difference between the warped image and I\en4 is only at the level of difference, 
in brightness (because of change in viewing angle). No apparent deformation is 
observed in the warped image. The flow held is as smooth as the how from Kenl 
to Ken2, implying that the how was added in a consistent manner. 

Long Range Motion 


16 




Figure 4: Results of adding flow from Kent to Ken4. First row: Kenl,Ken4 and 
the warped image Kenl-4. Second row: edges of Kent, I\en4 and edges of both 
superimposed. Third row: edges of I\enl-4, edges of I\en4 and edges of Kenl-4 
superimposed, flow held from Kent to I\en4 (scaled for display). 








The two-stage scheme for measuring motion — nominal motion followed by a 
short-range residual motion detection — suggests that long-range motion can be 
handled in an area enclosed by the privileged points. The restriction of short-range 
motion is replaced by the restriction of limited depth variation from the reference 
plane. As long as the depth variation is limited, then correspondence should be 
obtained regardless of the range of motion. Note that this is true as long as we are 
sufficiently far away from the object’s bounding contour. The larger the rotational 
component of motion — the larger the number of points that go in and out of 
view. Therefore, we should not expect good correspondence at the boundary. The 
claim that is tested in the following experiment, is that under long range motion, 
correspondence is accurate in the region enclosed by the frame of reference, e.g. 
points that are sufficiently far away from the boundary. 

Fig. 5 shows the results of computing flow directly from Kent to Ken4. Note the 
effect of the nominal motion transformation. The nominal motion brings points 
closer together inside the frame of reference; points near the boundary are taken 
farther apart from their corresponding points because of the large depth difference 
between the corresponding object points and the reference plane. The warped 
image looks very similar to Ken4 except near the boundary of the object. The 
deformation there may be due to both the relatively large residual displacement, 
remaining after nominal motion was applied, and to the repetitive intensity struc¬ 
ture of the hair; the farther we go from the reference plane the larger the residual 
displacement aw. Therefore it may be that the frequency of the hair structure 
caused a misalignment at some level of the pyramid which was propagated. 

Approximate Correspondence With a Single Privileged Point 

The 2D affine transformation A, w derived from four corresponding points (Propo¬ 
sition 3) describes a constraint line, which together with the constant brightness 
equation, determines correspondence everywhere else. Also, as shown in appendix 
1, any 2D affine transformation that aligns three image points with their cor¬ 
responding points can be used to define the constraint line, together with one 
additional corresponding point. It may therefore be possible to look for an affine 
transformation A, w that approximately aligns 3 points, without actually using 3 
corresponding points. Bachelder and Ullman [4] show that measurements of nor¬ 
mal flow 5 along at least 6 points determines a 2D affine transformation. Burt et. 
al. [13, 14] show a similar result by deriving a 2D affine transformation directly 
from the instantaneous brightness measurements using the constant brightness 
equation. 

Following Burt et. al. we look for a 2D affine transformation A, w that minimizes 

5 the component of p' — p along the image gradient or along the normal to the contour passing 
through p. 
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Figure 5: Results of computing long-range flow from Kent to Ken4. First row: 
Kenl,I\en4 and the warped image Kenl-4. Second row: edges of Kent and I\en4 
superimposed, edges of I\en4 and edges of I\enl-4. Third row: edges of I\en4 
superimposed on edges of the nominal transformed Kent, edges of I\en4 and Kenl- 
4 superimposed, and difference between edges of ken4 edges of I\enl-4. 
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the total squared error of the constant brightness equation for which A(op) + o' + 
w — p is substituted for the unknown velocity. Algebraically, this takes the form: 

min^(V/ • (A(opi) + o' + w - p t ) + I t f. 

A.W 

% 

where o <. - > o' is a given privileged point. Note that if the area of summation 

corresponds to a planar patch, then A, w will represent the motion of some plane 
moving with the object, and therefore accurately aligns at least 3 points with their 
corresponding points. For a non-planar patch this is not guaranteed, and A,w 
will only approximately align at least 3 points. 

A single region, covering the entire face, was chosen in order to test the accuracy 
of this scheme on non-planar patches. The affine parameters estimation was per¬ 
formed in an hierarchical framework, and a single privileged point was then chosen 
(the tip of the left eye). The results of aligning Kenl and Ken2 are perceptually 
identical to the four privileged point scheme. The results of aligning Kenl and 
Ken3, separated by 14° of rotation, are shown in Fig. 6. Note that although re¬ 
sults differ between the four point scheme and the single point scheme, the quality 
is very similar. 


2 Object Recognition and Structure from Mo¬ 
tion 

The geometrical aspect of visual recognition can be viewed as a problem of com¬ 
pensating for changes in the image induced by changing view positions [50, 26]. 
Under this view, the visual system must confront similar issues to those dealt with 
in SFM, albeit in a more implicit manner — one is more concerned in factoring 
out the effects of shape and viewing transformation on the changing image, rather 
than recovering them. 

Three concepts, that have been recently introduced, seem to play an important 
role in this view. The first concept is the equivalence between the process of com¬ 
pensating for the change in the image and the process of generating the image 
from a 2D model [51, 18]. For instance, Ullman and Basri [51] have shown that 
all possible views that can undergo a similarity transformation in space (rotation, 
translation and scale), are spanned by the linear combination of three views of the 
object (two in the case of affine transformation in space, see also result by Pog- 
gio [35]). Therefore, any process that can generate a novel view from a 2D model 
is relevant for purposes of recognition. The second concept, introduced also by 
Ullman and Basri, is that shape information is equivalent to full correspondence 
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Figure 6: Comparing thpfour points scheme to the single privileged point scheme. 
First row: Kenl,I\en3 and their superimposed edge images. Second row: edges of 
I\en3 and Kenl-3 (the warped image) superimposed using four privileged points, 
edges of I\en3 and Kenl-3 superimposed using a single privileged point, the differ¬ 
ence between edges of Kenl-3 produced by the four point scheme? and the single 
point scheme. 
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among a small set of model views. One therefore does not need to explicitly re¬ 
cover shape and view transformation (motion) in order to generate a novel view 
— four corresponding points between the novel view and the model views is suf¬ 
ficient for generating the entire view. Finally, the third concept is the distinction 
between objects with sharp bounding contours and objects with smooth bounding 
contours [7, 51]. An object with a smooth bounding contour, such as an ellipsoid, 
does not induce a one-to-one mapping between the object’s bounding contours and 
the projected silhouette. Furtheremore, the bounding contours that generate the 
silhouette move constantly on the object as the viewing position changes. This 
case may, therefore, require special attention in generating novel views. Ullman 
and Basri have shown that for this case the number of views required to approxi¬ 
mately span all views undergoing a similarity transformation is five (three in the 
case of affine transformation in space). 

The results derived in section 1 are shown to be relevant to visual recognition in 
the context of the three concepts described above. In particular, (i) the result that 
two model views in full correspondence together with four corresponding points 
with a novel view are sufficient to generate the entire view [51, 35] is rederived 
using tools from section 1, (ii) a single view can generate novel views taken under 
similar illumination conditions undergoing limited changes of view position, and 
(iii) novel views of objects with smooth boundaries can be generated, to a good 
approximation, from two views in full correspondence. 


2.1 Recognition from a Single View 

The main result, derived in section 1, is that correspondence can be recovered from 
two pictures, taken under similar illumination conditions, of an object undergoing 
an affine transformation in space. The range of allowed viewing transformation 
was shown to be limited by the structure of the object — the smaller the depth 
variation, in the region of four corresponding points, the larger the range of viewing 
transformations. In the context of the first concept, this result is equivalent of 
saying that a novel view can be generated from a single model view (picture) and 
four corresponding points, provided the model image and the input image are 
taken under similar illumination conditions and with a restricted range of viewing 
transformations. 

One straightforward extension is to treat regions of the object as locally flat, and 
by that to increase the range of viewing transformations for the entire object. This 
can be implemented by imposing a triangulation on a set of more then four corre¬ 
sponding points [26]. The triangulation divides the image into regions, each with 
three corresponding points, within which the correspondence method discussed in 
section 1 can be applied (the fourth corresponding point can be shared among all 
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triangles). 


2.2 Recognition from Two views: Objects with Sharp 
Boundaries 

The basic result, derived by Ullman and Basri [51] and by Poggio [35], is that two 
model views with full correspondence are sufficient to generate, using the linear 
combination scheme, a novel view given four corresponding points between the 
novel view and the model views. Ullman and Basri also pointed out, that with only 
two model views one cannot distinguish between a non-rigid linear transformation 
and a rigid transformation of the object. 

There are two ways, both straightforward, to re-derive this result in the framework 
of recovering affine shape. The first derivation follows directly from equations 1 
and 2, that for convenience are reproduced below: 

3 

°p = Y, b j(°pj) 

3 = 1 

o'p' = b^o'p'j). 

3 = 1 

The affine coordinates can be recovered for every corresponding point, and there¬ 
fore can be recovered for all points in model view V\ given full correspondence 
with model view V 2 . Since the affine coordinates are invariant under any affine 
transformation in space, then given a novel view V and four corresponding points 
with Vi and U 2 one can recover the affine coordinates from the known correspon¬ 
dence Vi < - > V 2 and use them to generate V from V\ (or from U 2 ). Incidently, 

this also shows that 1.5 views are sufficient [51, 35] because 2 views provide an 
over-determined system for solving for the affine coordinates. 

One can use a more practical method for generating a novel view by using the 
constraint line derived in Proposition 3. This can be done in the following way. 
Let p,p' and p" be the image coordinates of the point P in the two model views 
Ui, V 2 and the third novel view V, respectively. Given four corresponding points 
along the three views one can construct the constraint line, equation 8, between 
Vi, V 2 and between fo, V. We take advantage of the separation of shape and motion 
in equation 8 by noticing that the scale factor a is the same along the constraint 
line from p to p 1 and from p to p". We therefore can find a from the known 
correspondence p, p 1 and use that to fold the corresponding point in the third view 
p". Since a is invariant under affine transformations in space one cannot distinguish 
between a non-rigid linear transformation and a rigid transformation of the object. 
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Also, from Propositions 3 and 4, the transformation between the two model views 
should be other than a pure rotation around the line of sight (w = 0 in that case). 


2.3 Recognition from Two views: Objects with Smooth 
Boundaries 

In the case of objects with smooth boundaries, the correspondence between the 
two model views at and near the silhouette no longer relates to the true affine 
shape parameters at these points. This is because any two corresponding points 
along the silhouette are projected from different object points. 

The fact that the shape parameter a that is recovered from a silhouette point in 
view V\ and its corresponding silhouette point in is not equal to the shape pa¬ 
rameter associated with any of the object points projecting to the two correspond¬ 
ing points may work to our advantage. The reason is that the shape parameter a' 
that is required to correctly generate the same silhouette point in a novel view V 
also does not relate to a true shape parameter, and therefore it may be expected 
that a ss a 1 . It is important to note that as long as the four privileged points are 
true corresponding points (i.e. not on the silhouette), then the nominal motion 
transformation and the direction of the constraint line w are correct for all points, 
including those at and near the bounding contour — it is only the shape parameter 
a that may be inaccurate at these points. 

If indeed a is a good approximation to a', then one can use the same method for 
generating novel views as that used for objects with sharp boundaries — with the 
same number of model views. 

The following section analyzes the accuracy of this method under the assumption 
of pure rotation around y axis (rotation around z axis can be neglected), reference 
plane ortho-parallel, and that rim points are on locally spherical patches. Under 
these assumptions, the error relative to the radius of curvature at the rim is shown 
to be typically less than 3% for relatively large rotations (30 degrees) and less than 
1% for a 15 degree rotation. Experimental results follow. 


2.4 Analysis of the Prediction Method 

For the purpose of analysis, one can ignore rotation around the z axis, translation, 
and scaling. I further make the following simplifications: (i) reference plane is 
ortho-parallel to the image plane, (ii) rotation is only around the y axis, and (iii) 
the boundary points projecting to the silhouettes are on locally spherical patches, 
with radius r. 
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Figure 7: Gross section of a sphere perpendicular to the vertical axis. See text for 
reference. 

Fig. 7 shows a cross section of a sphere, that is perpendicular to the y axis, 
and a point P on its rim. The point P' s is the new rim point followed by a cp 
degree rotation around the y axis. Let the reference plane be at a distance r sin (3 
from the center of the sphere, and let the privileged point 0 be located on the 
sphere such that the distance z 0 — z 0 = rp , for some constant p. We therefore 
have that | ie| = rp sirup. The nominal transformation associated with P is the 
projection of P' which is equal to \A(op) + w\ = r(cos cp — sincp sin (3). The shape 
parameter a scales w to satisfy the equation (only the ,r-component is displayed): 
A(op ) + w + aw = r, and therefore 

1 — cos <p + sin <p sin [3 

Q = -^—7-• 

p sin <p 

We use o to predict the new location of the corresponding point p' resulting from 
some other angle of rotation (p around the Y axis. The motion component and the 
length of w corresponding to the new angle <p are r(cos cp — sincp sin (3) and rp sin <p, 
respectively. Noting that an exact correspondence will set p 1 = r for any angle cp , 
the error, relative to the radius r, is therefore: 

- (1 — cos <p ) sin cp 

€ = COS cp H-:-;-. 

Sill cp 
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Figure 8: Percentage of relative error as a function of the angle between the model 
views, assuming worst case interpolation error. 


The worst case error for interpolation, i.e., <^> < <^>, is when </> = arctan = f • 

Fig. 8 shows the percentage of error as a function of <f>, taking <j> to be the worst 
case interpolation error. We see that the more distant the two model views — the 
larger the relative error. Also, the absolute error increases with the radius r, for 
example, the lower the curvature along the line of sight the larger the absolute 
error. The expected worst case absolute error for generating views of ‘Ken’, given 
Kenl and I\en4 as the model views, are 1.5 pixels for I\en2 and 2 pixels for I\en3. 
This is because the projected radius is about 100 pixels, <j> = 23 and </> = 6, 9 for 
I\en2 and I\en3, respectively. 

Experimental results shown below confirm these estimates. Full correspondence?, 
were obtained using the incremental flow estimation described earlier (results that 
were shown in Fig. 4). Four corresponding points were manually chosen among 
the two model views and between I\en2 and I\en3. Fig. 9 shows the results of 
generating I\en2 and I\en3. As expected, the errors in the silhouette of I\en2 are 
smaller than those in I\en3. This is because I\en3 is further apart from the model 
views than I\en2, as illustrated in the figure. The errors along the silhouette of 
I\en2 are less than 1 pixel for most of the points, and along other silhouette points 
the error is between 1 to 2 pixels. The errors along the silhouette of I\en3 are 
between 1 and 2 pixels. 

In conclusion, a method for generating novel views from two model views was 
suggested. The method is based on the principle of shape and motion separation 
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Figure 9: Generating novel views from two model views, Kenl and Ken4. The top 
row shows results of generating I\en2 from Kenl, and the bottom row shows tlnfe 
results of generating I\en3 from Ken4. The first two images illustrate the distance 
between the novel view and the two model views, the third image is an overlay 
of the edges of the original view and the predicted view, the fourth image is the 
generated view. 



derived in Proposition 3. The method is accurate for objects with sharp bounding 
contours, and can handle, to a good approximation, objects with smooth bounding 
contours. An analytic analysis followed by experimental results on images of ‘Ken’ 
illustrate the accuracy of the method. 

In comparing the affine-shape recognition scheme to the linear combination scheme 
of Ullman and Basri, one sees the following tradeoff: the linear combination scheme 
is theoretically more accurate in the case of objects with smooth boundaries (see 
Basri [5], for analysis) than the method suggested here. This is at the expense of 
having more views in the model, and having to include points along the bounding 
contour in the sample of privileged points (otherwise the linear coefficients are not 
unique). The method suggested here does not require privileged points along the 
boundary, which makes it easier to find a small number of reliable points to use 
for recognition. 


3 Summary 

The paper presented a model for recovering affine shape and correspondence from 
two orthographic views for the purposes of structure from motion and object 
recognition. It was shown that it is possible to recover shape and full corre¬ 
spondence/flow simultaneously, by using the instantaneous change in brightness, 
together with four corresponding points, as an integral part of the computational 
model. It was shown that a 2D affine transformation, derived directly from the 
four corresponding points, represents a constraint line that captures both the affine 
shape and the motion of a plane, that serves as a frame of reference, imposed on 
the object (Proposition 3). Based on that result, it was suggested that the mea¬ 
surement of motion starts by imposing a frame of reference that is defined by a 
small number of salient, unambiguously matched, features in the image. The mo¬ 
tion of the frame ‘captures’ the motion of the remaining image points and takes 
them part of the way towards their corresponding points. Motion is then refined 
using local spatio-temporal detectors that are tuned along a known direction in 
the image. The magnitude of the refinement is bounded by the depth variation 
between the surface and the frame (Proposition 6). 

Those results were shown to apply to visual recognition by generating novel views 
of sharp and smooth boundary objects from two model views, or from a single 
view but with a restricted viewing transformation range. 

Acknowledgments 

Part of this work was done during my visit with the vision group headed by Peter 
Burt at David Sarnoff Research Center, Princeton NJ. Thanks to all members 


28 



of the group, including Padmanabhan Anandan, Jim Bergen, Keith Hanna, Neil 
Okamoto and Rick Wildes for many discussions and for providing an inspiring 
atmosphere to work in. Special thanks to P. Anandan for his suggestions and 
contribution to Proposition 3 and to J. Bergen for his suggestion to use affine 
motion methods [13, 14] to substitute privilege point information. 

Thanks to Tomaso Poggio and Whitman Richards for helpful comments and sug¬ 
gestions throughout this work. Thanks to Eric Crimson, Sandy Wells, David 
Jacobs and Todd Cass on comments on earlier drafts of this manuscript. Thanks 
to my advisor Shimon Ullman for keeping up with my long notes, sent over the 
bitnet, for his careful reading of previous drafts and for insightful comments. 


29 



Appendix 1: Alternative Algebraic Form of the Constraint Line 


Koenderink and Van-Doorn [28] show that the constraint line can be derived in 
two stages, first p is represented in the 2D affine frame defined by 3 of the corre¬ 
sponding points, and then the fourth point is used to fold the third affine vector 
by subtracting its corresponding point from its projection in the 2D affine frame. 

This result is derived below in a single step algebraic proof which also shows that 
the resulting constraint line is a particular case of equation 5 in which r 3 = 0. Also 
shown is the result that a 2D affine transformation aligning at least 3 points with 
their corresponding points can be used, together with an additional corresponding 
point, to derive a constraint line of the type introduced by Koenderink and Van- 
Doorn. 

Let A and w (not the same A, w as in Proposition 3) be the 2D affine transformation 
that align 3 of the corresponding points o,pi,p 2 , i.e. o' = Ao + w and p'- = 
Apj + w, j = 1, 2 . By subtracting the first equation from the other two we get: 

o'p[ = A(opi) 
o' p ' 2 = A(op 2 ) 

and therefore A is the matrix \o'p ' l7 o'p' 2 \[opi } op 2 ] _1 . Let p 3 be the fourth privileged 
point, and let p be an arbitrary point. We therefore have: 

3 

op = b j(°'p'j) = WA(opi) + b 2 A(op 2 ) + op 3 + b 3 A(op 3 ) - b 3 A(op 3 ) 

3 = 1 

= A(op) + b 3 (op 3 - A(op 3 )) 
and considering that w = o' — Bo we get: 

p> = Ap + w + b 3 (p 3 - Ap 3 - w) 

In this case the third frame vector is not in direction of w, but as Koenderink and 
Van-Doorn noted is the result of subtracting the projected motion of P 3 onto the 
reference plane passing through (9, Pi, P 2 from p 3 . 
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Appendix 2: Proof of Proposition 4 


Proposition 4 The 2D affine transformation defined in Proposition 3 admits the 
following interpretation: the affine vector w is the projection of the vector O' — O' 
onto the image plane and is perpendicular to the xy projection of the rotation axis 
of the transformation T. If T is a similarity transformation (rotation, translation 
and scale), then a associated with the point p is equal to: 

z — z 

a = - 

z 0 - 

where z,z,z 0 and z 0 are the depth values of P,P,0 and O, respectively. 

Proof: Geometrically, any point P that is coplanar with Pi,P 2 ,p 3 can be repre¬ 
sented as a convex combination of the three vectors OPj, therefore a of the point 
p is equal to 0. This in particular shows, in a very simple manner, that a 2D affine 
transformation accounts for the projected motion of a plane [42], This also proves 
that a represents the deviation of P, along the line of sight, from the reference 
plane. 

The vector OP can be represented as sum of the following two vectors: 

OP = P - O = [P - P) + [P - O). 

Applying an affine transformation T followed by an orthographic projection yields: 

o'p' = a[T{P-P)] + a[T{OP)}. 

Since P is on the reference plane, we have that a[T(OP)\ = A(op ) + w. We 
therefore have the following: 

cr[T(P — P)] = aw 

where a = b 3 — 1. Using the same reasoning, we get that 

w = a[T(0 — O )]. 

We therefore see that w is the projection of O — O at the second time cinstance. 
Because 0 — 0 is parallel to P — P, we may represent the deviation of P from the 
reference plane as a scale factor of O — O. Furthermore, since 0 — 0 is along the 
line of sight, then the projection a[T(0 — 0)\ is perpendicular to the xy projection 
of the rotation axis. 

If T is a rigid transformation, possibly followed by uniform scaling, then the rela¬ 
tionship between P — P and O — O remains fixed, before and after T is applied, 
and therefore a, the scale factor becomes: 

z — z 

a = - 

2 0 - 2 0 

where z, z , and 5 0 are the depth values of P, P, O and O , respectively. [] 
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